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Abstract 

General-purpose, intelligent, learning agents cycle through sequences of 
observations, actions, and rewards that are complex, uncertain, unknown, and 
non-Mar kovian. On the other hand, reinforcement learning is well-developed 
for small finite state Markov decision processes (MDPs). Up to now, extract- 
ing the right state representations out of bare observations, that is, reducing 
the general agent setup to the MDP framework, is an art that involves signif- 
icant effort by designers. The primary goal of this work is to automate the 
reduction process and thereby significantly expand the scope of many existing 
reinforcement learning algorithms and the agents that employ them. Before 
we can think of mechanizing this search for suitable MDPs, we need a formal 
objective criterion. The main contribution of this article is to develop such a 
criterion. I also integrate the various parts into one learning algorithm. Ex- 
tensions to more realistic dynamic Bayesian networks are developed in Part 
II [Hut09c] . The role of POMDPs is also considered there. 
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"Approximations, after all, may be made in two places - in the construc- 
tion of the model and in the solution of the associated equations. It is 
not at all clear which yields a more judicious approximation. " 

— Richard Bellman (1 961 ) 



1 Introduction 

Background Sc motivation. Artificial General Intelligence (AGI) is concerned 
with designing agents that perform well in a wide range of environments |GP07l 
LH07J. Among the well-established "narrow" Artificial Intelligence (AI) approaches 
[RN03j . arguably Reinforcement Learning (RL) |SB98] pursues most directly the 
same goal. RL considers the general agent-environment setup in which an agent 
interacts with an environment (acts and observes in cycles) and receives (occasional) 
rewards. The agent's objective is to collect as much reward as possible. Most if not 
all AI problems can be formulated in this framework. Since the future is generally 
unknown and uncertain, the agent needs to learn a model of the environment based 
on past experience, which allows to predict future rewards and use this to maximize 
expected long-term reward. 

The simplest interesting environmental class consists of finite state fully observ- 
able Markov Decision Processes (MDPs) [Put94| [SB98J, which is reasonably well 
understood. Extensions to continuous states with (non)linear function approxima- 
tion [5B98llGor99] . partial observability (POMDP) |KLG98l lRPPGd08| . structured 
MDPs (DBNs) SI )l.()7j. and others have been considered, but the algorithms are 
much more brittle. 

A way to tackle complex real-world problems is to reduce them to finite MDPs 
which we know how to deal with efficiently. This approach leaves a lot of work to 
the designer, namely to extract the right state representation ("features") out of 
the bare observations in the initial (formal or informal) problem description. Even 
if potentially useful representations have been found, it is usually not clear which 
ones will turn out to be better, except in situations where we already know a perfect 
model. Think of a mobile robot equipped with a camera plunged into an unknown 
environment. While we can imagine which image features will potentially be useful, 
we cannot know in advance which ones will actually be useful. 

Main contribution. The primary goal of this paper is to develop and investigate 
a method that automatically selects those features that are necessary and sufficient 
for reducing a complex real-world problem to a computationally tractable MDP. 

Formally, we consider maps $ from the past observation-reward-action history h 
of the agent to an MDP state. Histories not worth being distinguished are mapped 
to the same state, i.e. $ _1 induces a partition on the set of histories. We call this 
model $MDP. A state may be simply an abstract label of the partition, but more 
often is itself a structured object like a discrete vector. Each vector component 
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describes one feature of the history [Hut09al IHut09c] . For example, the state may 
be a 3-vector containing (shape, color, size) of the object a robot tracks. For this 
reason, we call the reduction, Feature RL, although in this Part I only the simpler 
unstructured case is considered. 

$ maps the agent's experience over time into a sequence of MDP states. Rather 
than informally constructing $ by hand, our goal is to develop a formal objective 
criterion Cost($|/i) for evaluating different reductions $. Obviously, at any point 
in time, if we want the criterion to be effective it can only depend on the agent's 
past experience h and possibly generic background knowledge. The "Cost" of $ 
shall be small iff it leads to a "good" MDP representation. The establishment 
of such a criterion transforms the, in general, ill-defined RL problem to a formal 
optimization problem (minimizing Cost) for which efficient algorithms need to be 
developed. Another important question is which problems can profitably be reduced 
to MDPs |Hut09allHuin9cj . 

The real world does not conform itself to nice models: Reality is a non-ergodic 
partially observable uncertain unknown environment in which acquiring experience 
can be expensive. So we should exploit the data (past experience) at hand as well 
as possible, cannot generate virtual samples since the model is not given (need to 
be learned itself), and there is no reset-option. No criterion for this general setup 
exists. Of course, there is previous work which is in one or another way related to 
$MDP. 

fMDP in perspective. As partly detailed later, the suggested $MDP model has 
interesting connections to many important ideas and approaches in RL and beyond: 

• $MDP side-steps the open problem of learning POMDPs [KLC98J, 

• Unlike Bayesian RL algorithms [DFA991 IDuf02l IPVHR061 IRP08j . $MDP 
avoids learning a (complete stochastic) observation model, 

• $MDP is a scaled-down practical instantiation of AIXI |Hut05t lHut07| . 

• $MDP extends the idea of state-aggregation from planning (based on bi- 
simulation metrics |GDG03j ) to RL (based on information), 

• $MDP generalizes U-Tree |McC96] to arbitrary features, 

• $MDP extends model selection criteria to general RL problems |Gru07j . 

• $MDP is an alternative to PSRs |SLJ + 03j for which proper learning algorithms 
have yet to be developed, 

• $MDP extends feature selection from supervised learning to RL [GE03J. 

Learning in agents via rewards is a much more demanding task than "classical" ma- 
chine learning on independently and identically distributed (i.i.d.) data, largely due 
to the temporal credit assignment and exploration problem. Nevertheless, RL (and 
the closely related adaptive control theory in engineering) has been applied (often 
unrivaled) to a variety of real-world problems, occasionally with stunning success 
(Backgammon, Checkers, [SB981 Chp.ll], helicopter control |NCD+04j ). $MDP 
overcomes several of the limitations of the approaches in the items above and thus 
broadens the applicability of RL. 
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$MDP owes its general-purpose learning and planning ability to its information 
and complexity theoretical foundations. The implementation of $MDP is based on 
(specialized and general) search and optimization algorithms used for finding good 
reductions $. Given that $MDP aims at general AI problems, one may wonder 
about the role of other aspects traditionally considered in AI [RN03J: knowledge 
representation (KR) and logic may be useful for representing complex reductions 
Agent interface fields like robotics, computer vision, and natural language 
processing can speedup learning by pre&post-processing the raw observations and 
actions into more structured formats. These representational and interface aspects 
will only barely be discussed in this paper. The following diagram illustrates $MDP 
in perspective. 

/s 



Universal AI 
(AKI) 





( $MDP / $DBN ~) 



^^Information^" Learning Planning ^Complexity^)^^ 



Search - Optimization - Computation - Logic - KR 



Agents = Framework, Interface = Robots,Vision, Language 



Contents. Section [2] formalizes our $MDP setup, which consists of the agent model 
with a map $ from observation-reward-action histories to MDP states. Section [3] 
develops our core $ selection principle, which is illustrated in Section ion a tiny 
example. Section [5] discusses general search algorithms for finding (approximations 
of) the optimal concretized for context tree MDPs. In Section [6] I find the 
optimal action for $MDP, and present the overall algorithm. Section [7] improves 
the <3> selection criterion by "integrating" out the states. Section [8] contains a brief 
discussion of $MDP, including relations to prior work, incremental algorithms, and 
an outlook to more realistic structured MDPs (dynamic Bayesian networks, $DBN) 
treated in Part II. 

Rather than leaving parts of $MDP vague and unspecified, I decided to give at 
the very least a simplistic concrete algorithm for each building block, which may be 
assembled to one sound system on which one can build on. 

Notation. Throughout this article, log denotes the binary logarithm, e the empty 
string, and 8 x>y = 5 xy = 1 if x = y and else is the Kronecker symbol. I generally 
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omit separating commas if no confusion arises, in particular in indices. For any x 
of suitable type (string,vector,set), I define string x = x\-i = Xi...x\, sum x + = ^2-Xj, 
union x^=\A-Xj, and vector x. = (xi,...,Xi), where j ranges over the full range {1,...,/} 
and 1 = |x| is the length or dimension or size of x. x denotes an estimate of x. P(-) 
denotes a probability over states and rewards or parts thereof. I do not distinguish 
between random variables X and realizations x, and abbreviation P(x): = P[X = x] 
never leads to confusion. More specifically, m G IV denotes the number of states, 
i £ {l,...,m} any state index, n G IV the current time, and t E {l,...,n} any time in 
history. Further, in order not to get distracted at several places I gloss over initial 
conditions or special cases where inessential. Also 0*undenned=0*infinity:=0. 



2 Feature Markov Decision Process ($MDP) 

This section describes our formal setup. It consists of the agent-environment frame- 
work and maps $ from observation-reward-action histories to MDP states. I call 
this arrangement "Feature MDP" or short $MDP. 

Agent- environment setup. I consider the standard agent-environment setup 
[RN03| in which an Agent interacts with an Environment The agent can choose 
from actions aEA (e.g. limb movements) and the environment provides (regular) 
observations o G (9 (e.g. camera images) and real-valued rewards reTZClR to the 
agent. The reward may be very scarce, e.g. just +1 (—1) for winning (losing) a chess 
game, and at all other times [Hut 05 1 Sec. 6. 3]. This happens in cycles t — 1,2,3,...: 
At time t, after observing ot and receiving reward r t , the agent takes action at based 
on history h t :=o\rxa\...ot-irt-i(it-\Otrt- Then the next cycle t+1 starts. The agent's 
objective is to maximize his long-term reward. Without much loss of generality, I 
assume that 1Z is finite. Finiteness of 1Z is lifted in |Hut09al [Hut09cJ. I also as- 
sume that A is finite and small, which is restrictive. Part II deals with large state 
spaces, and large (structured) action spaces can be dealt with in a similar way. No 
assumptions are made on O; it may be huge or even infinite. Indeed, $MDP has 
been specifically designed to cope with huge observation spaces, e.g. camera images, 
which are mapped to a small space of relevant states. 

The agent and environment may be viewed as a pair or triple of interlocking 
functions of the history H := (Ox Ax 11)* x O xTZ: 



observation 

Env : H x A ~> O x 11, o n r n = Env(/i„_ia ri _i), \ | reward 

Agent A, a n = Agent (h n ), (Agent) (E 



D 



action \ 

where indicates that mappings — > might be stochastic. 

The goal of AI is to design agents that achieve high (expected) reward over the 
agent's lifetime. 

(Un)known environments. For known EnvQ, finding the reward maximizing 
agent is a well-defined and formally solvable problem [ Hut 5 [ Clip. 4], with com- 
putational efficiency being the "only" matter of concern. For most real-world AI 
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problems EnvQ is at best partially known. For unknown Env(), the meaning of 
expected reward maximizing is even conceptually a challenge |Hut05l Chp.5]. 

Narrow AI considers the case where function EnvQ is either known (like planning 
in blocks world), or essentially known (like in chess, where one can safely model the 
opponent as a perfect minimax player), or EnvQ belongs to a relatively small class 
of environments (e.g. elevator or traffic control). 

The goal of AGI is to design agents that perform well in a large range of environ- 
ments [LH07J, i.e. achieve high reward over their lifetime with as little as possible 
assumptions about EnvQ. A minimal necessary assumption is that the environment 
possesses some structure or pattern [WM97j . 

From real-life experience (and from the examples below) we know that usually we 
do not need to know the complete history of events in order to determine (sufficiently 
well) what will happen next and to be able to perform well. Let <&(h) be such a 
"useful" summary of history h. 

Generality of fMDP. The following examples show that many problems can be 
reduced (approximately) to finite MDPs, thus showing that $MDP can deal with 
a large variety of problems: In full-information games (like chess) with a static op- 
ponent, it is sufficient to know the current state of the game (board configuration) 
to play well (the history plays no role), hence $(/it) = Ot is a sufficient summary 
(Markov condition). Classical physics is essentially predictable from the position 
and velocity of objects at a single time, or equivalently from the locations at two 
consecutive times, hence §{h t ) = o t -io t is a sufficient summary (2nd order Markov). 
For i.i.d. processes of unknown probability (e.g. clinical trials ~ Bandits), the fre- 
quency of observations ^(h n ) = (Y^t=i^o t o)oeo is a sufficient statistic. In a POMDP 
planning problem, the so-called belief vector at time t can be written down explicitly 
as some function of the complete history h t (by integrating out the hidden states). 
<&(h t ) could be chosen as (a discretized version of) this belief vector, showing that 
$MDP generalizes POMDPs. Obviously, the identity = h is always sufficient 
but not very useful, since EnvQ as a function of TC is hard to impossible to "learn" . 

This suggests to look for $ with small codomain, which allow to 
learn/estimate/approximate Env by Env such that «Env(<I>(/i$_i)) for t = l...n. 

Example. Consider a robot equipped with a camera, i.e. o is a pixel image. Com- 
puter vision algorithms usually extract a set of features from o t -i (or h t -i), from 
low-level patterns to high-level objects with their spatial relation. Neither is it pos- 
sible nor necessary to make a precise prediction of Ot from summary $(/i t _i). An 
approximate prediction must and will do. The difficulty is that the similarity mea- 
sure needs to be context dependent. Minor image nuances are irrelevant when 
driving a car, but when buying a painting it makes a huge difference in price whether 
it's an original or a copy. Essentially only a bijection <3> would be able to extract all 
potentially interesting features, but such a <3? defeats its original purpose. 

From histories to states. It is of utmost importance to properly formalize the 
meaning of "~" in a general, domain-independent way. Let s t :=&{h t ) summarize all 
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relevant information in history h t . I call s a state or feature (vector) of h. "Relevant" 
means that the future is predictable from St (and at) alone, and that the relevant 
future is coded in st+iSt+2---- So we pass from the complete (and known) history 
0\rx(i\-.-o n r n a n to a "compressed" history sra\._ n = S\T\a\...s n r n a n and seek $ such 
that s t +i is (approximately a stochastic) function of s t (and a t ). Since the goal of 
the agent is to maximize his rewards, the rewards r t are always relevant, so they 
(have to) stay untouched (this will become clearer below). 

The fMDP. The structure derived above is a classical Markov Decision Process 
(MDP), but the primary question I ask is not the usual one of finding the value 
function or best action or comparing different models of a given state sequence. I 
ask how well can the state-action-reward sequence generated by $ be modeled as an 
MDP compared to other sequences resulting from different $. A good $ leads to a 
good model for predicting future rewards, which can be used to find good actions 
that maximize the agent's expected long-term reward. 



3 $MDP Coding and Evaluation 

I first review a few standard codes and model selection methods for i.i.d. sequences, 
subsequently adapt them to our situation, and show that they are suitable in our 
context. I state my Cost function for $, and the $ selection principle, and compare 
it to the Minimum Description Length (MDL) philosophy. 

I.i.d. processes. Consider i.i.d. Xi...x n E X n for finite X — {l,...,m}. For known 
9i = P[x t = i} we have P(xi :n \6) = 9 Xl ■ ... -9 Xn . It is well-known that there exists 
a code (e.g. arithmetic or Shannon- Fano) for x 1:n of length — logP(x 1:n |0), which 
is asymptotically optimal with probability one [Bar85, Thm.3.1]. This also easily 
follows from |CT06l Thm.5.10.1]. 

MDL/MML code t Gru01\ I Wa l05l: For unknown 6 we may use a frequency 



estimate 9i = rii/n, where rii = \{t < n : x t = Then it is easy to see that 
— logP(xi :n |6?) =n H(6), where 

m 

H{6) := — ^^§ilog9i is the entropy of 6 
i=i 

(01og0:=0=:01og[j). We also need to code 6, or equivalently (rij), which naively needs 
logra bits for each i. In general, a sample size of n allows estimating parameters only 
to accuracy 0{\/^/n), which is essentially equivalent to the fact that logP(xi :n |6J± 
0(1/a/^)) — logP(xi :n |^) = 0(1). This shows that it is sufficient to code each 9i to 
accuracy Oil/y/n), which requires only ilogn + 0(l) bits each. Hence, given n and 
ignoring 0(1) terms, the overall code length (CL) of x\- n for unknown frequencies is 

CL(xi :n ) = CL(n) := n H (n/n) + log n for n > and else, (1) 
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where n = (ni,...,n m ) and n = n + =rii + ... + n m . We have assumed that n is given, 
hence only m— 1 of the n« need to be coded, since the mth one can be reconstructed 
from them and n. The above is an exact code of x\- n) which is optimal (within 
+0(1)) for all i.i.d. sources. This code may further be optimized by only coding 8i 
for the m' = \{i:ni>0}\<m non-empty categories, resulting in a code of length 

CL'(n) := n H (n/n) + ^ log n + m, (2) 

where the m bits are needed to indicate which of the §i are coded. We refer to this 
improvement as sparse code. 

Combinatorial code ILVOSH/ : A second way to code the data is to code n exactly, 
and then, since there are n\/n\\...n m \ sequences xi- n with counts n, we can easily 
construct a code of length \og(n\/n\\...n m \) given n by enumeration, i.e. 

CL"(n) := log(n!/ni!...n m !) + (m— 1) logn 

Within ±0(1) this code length also coincides with ([I]). 



Incremental code IWST97^: A third way is to use a sequential estimate 9\ 



t+i 



based on known past counts ti= \{t'<t:x t i =i}\, where a>0 is some regularizer. 

Then 

P(^l:n) = ' ••■ ' @-r = O q — ^r-r ' r , C a '.= -=7 — r — (3) 

v ' Xl Xn T(n + ma) T(a) m K ' 

where V is the Gamma function. The logarithm of this expression again essentially 
reduces to ([T]) (for any ct>0, typically | or 1), which can also be written as 

CL"'(n) = lnF ( n i) - lnr(n) + 0(1) if n > and else. 

i:n;>0 



Bayesian code t Sch78, \Mac03f : A fourth (the Bayesian) way is to assume a 



Dirichlet(a) prior over 6. The marginal distribution (evidence) is identical to ([3]) 
and the Bayesian Information Criterion (BIC) approximation leads to code ([1]). 

Conclusion: All four methods lead to essentially the same code length. The 
references above contain rigorous derivations. In the following I will ignore the 0(1) 
terms and refer to (CD) simply as the code length. Note that X\. n is coded exactly 
(lossless). Similarly (see MDP below) sampling models more complex than i.i.d. 
may be considered, and the one that leads to the shortest code is selected as the 
best model |Gru07] . 

MDP definitions. Recall that a sequence sra\- n is said to be sampled from an 
MDP (S,A,T,R) iff the probability of s t only depends on s t -i and a t -i, and r t only 
on s t -i, at-i, and s t . That is, 



P(st\ht-iat-i) = P(s*K_i,at_i) =: T. 



<H-1 
t-ist 



P(r t \h t ) = Pirtls^at^st) =: R^ rt 



«t-i«t 
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In our case, we can identify the state-space S with the states si,...,s n "observed" 
so far. Hence S = {s 1 ,...,s m } is finite and typically m<n, since states repeat. Let 
s— >s'(r') be shorthand for "action a in state s resulted in state s' (reward r')". Let 
T s a J : = {t<n:s t _ 1 = s,a t -i = a,s t = s' ,r t = r'} be the set of times t—l at which s— >s'r', 
and rigg, := \ T s a J | their number (n++ =n). 

Coding MDP sequences. For some fixed s and a, consider the subsequence 
s tl ...St n , of states reached from s via a (s— >Sf i ), i.e. {ti,...,t n >} = T s a *, where n' = n°+. 
By definition of an MDP, this sequence is i.i.d. with s' occurring n' s , :=w"J times. 
By (TjQ) we can code this sequence in CL(n') bits. The whole sequence si :n consists 
of |<Sx„4| i.i.d. sequences, one for each (s,a) <ESxA. We can join their codes and 
get a total code length 

CL(s 1:n |a 1:n ) = J]CL«+) (4) 

s,a 

If instead of ([I]) we use the improved sparse code ([2]), non-occurring transitions 
s—>s'r' will contribute only one bit rather than \\ogn bits to the code, so that large 
but sparse MDPs get penalized less. 

Similarly to the states we code the rewards. There are different "standard" 
reward models. I consider only the simplest case of a small discrete reward set 1Z 
like {0,1} or { — 1,0,-1-1} here and defer generalizations to M and a discussion of 
variants to the $DBN model [Hut09aJ. By the MDP assumption, for each (s,a,s') 
triple, the rewards at times T s a s * are i.i.d. Hence they can be coded in 

CL(r 1:n |s 1:n ,a 1:n ) = ^CL(tC) (5) 

s,a,s' 

bits. In order to increase the statistics it might be better to treat r t as a function of 
St only. This is not restrictive, since dependence on st-i and at-i can be mimicked 
by coding aspects into an enlarged state space. 

Reward-Estate trade-off. Note that the code for r depends on s. Indeed we 
may interpret the construction as follows: Ultimately we/the agent cares about the 
reward, so we want to measure how well we can predict the rewards, which we do 
with (jSJ). But this code depends on s, so we need a code for s too, which is (jlj). To 
see that we need both parts consider two extremes. 

A simplistic state transition model (small |<S|) results in a short code for s. For 
instance, for |«S| = 1, nothing needs to be coded and (jlj is identically zero. But this 
obscures potential structure in the reward sequence, leading to a long code for r. 

On the other hand, the more detailed the state transition model (large |«S|) the 
easier it is to predict and hence compress r. But a large model is hard to learn, 
i.e. the code for s will be large. For instance for $(/i) = h, no state repeats and the 
frequency-based coding breaks down. 
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<fr selection principle. Let us define the Cost of Q-.Ti^S on h n as the length of 
the $MDP code for sr given a plus a complexity penalty CL(<E>) for $: 

Cost($|/i„) := CL(s 1:n |a 1:n ) + CL(r 1:n |s 1:n , a 1:n ) + CL($), (6) 
where s t = and h t = orai :t -iO t r t 



The discussion above suggests that the minimum of the joint code length (jl]) and ([5]) 
is attained for a $ that keeps all and only relevant information for predicting rewards. 
Such a $ may be regarded as best explaining the rewards. I added an additional 
complexity penalty CL($) for $ such that from the set of $ that minimize + 
(e.g. $'s identical on (OxTZx A) n but different on longer histories) the simplest 
one is selected. The penalty is usually some code- length or log- index of $. This 
conforms with Ockham's razor and the MDL philosophy. So we are looking for a $ 
of minimal cost: 

$ best := argmin{Cost($|/i n )} (7) 



If the minimization is restricted to some small class of reasonably simple $, CL($) 
in ([6]) may be dropped. The state sequence generated by $ 6es * (or approximations 
thereof) will usually only be approximately MDP. While Cost($|/i) is an optimal 
code only for MDP sequences, it still yields good codes for approximate MDP se- 
quences. Indeed, Q best balances closeness to MDP with simplicity. The primary 
purpose of the simplicity bias is not computational tractability, but generalization 
ability |Leg08[ [HuS!]. 



Relation to MDL et al. In unsupervised learning (clustering and density es- 
timation) and supervised learning (regression and classification), penalized maxi- 
mum likelihood criteria [HTFOll Chp.7] like BIC |Sch78j . MDL [Gru07| . and MML 
[Wal05j have successfully been used for semi-parametric model selection. It is far 
from obvious how to apply them in RL. Indeed, our derived Cost function cannot 
be interpreted as a usual model+data code length. The problem is the following: 

Ultimately we do not care about the observations but the rewards. The re- 
wards depend on the states, but the states are arbitrary in the sense that they are 
model-dependent functions of the bare data (observations). The existence of these 
unobserved states is what complicates matters, but their introduction is necessary 
in order to model the rewards. For instance, $ is actually not needed for coding 
rs\a, so from a strict coding/MDL perspective, CL(<3>) in ([6]) is redundant. Since 
s is some "arbitrary" construct of $, it is better to regard as a code of r only. 
Since the agent chooses his actions, a need not be coded, and o is not coded, because 
they are only of indirect importance. 

The CostQ criterion is strongly motivated by the rigorous MDL principle, but 
invoked outside the usual induction/modeling/prediction context. 
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4 A Tiny Example 



The purpose of the tiny example in this section is to provide enough insight into 
how and why <3>MDP works to convince the reader that our $ selection principle is 
reasonable. 

Example setup. I assume a simplified MDP model in which reward r t only depends 
on St, i.e. 

CL(r 1:n |s 1:n ,a 1:n ) = ^CL(n+;,) (8) 

s> 

This allows us to illustrate $MDP on a tiny example. The same insight is gained 
using (jHJ) if an analogous larger example is considered. Furthermore I set CL($) = 0. 

Consider binary observation space O = {0,1}, quaternary reward space 71 = 
{0,1,2,3}, and a single action *4.= {0}. Observations o t are independent fair coin 
flips, i.e. Bernoulli(|), and reward r t = 2o t -\ + o t a deterministic function of the two 
most recent observations. 

Considered features. As features $ I consider & k :H^O k with § k {h t )=o t -k+i---Ot 
for various fc = 0,l,2,... which regard the last k observations as "relevant". Intuitively 
$2 is the best observation summary, which I confirm below. The state space S = 
{0,l} k (for sufficiently large n). The $MDPs for fc = 0,l,2 follows. 



$ MDP $iMDP $ 2 MDP 




$2MDP with all non-zero transition probabilities being 50% is an exact represen- 
tation of our data source. The missing arrow (directions) are due to the fact that 
s = o t _iOf can only lead to s' = o' t o' t+1 for which o' t = o t , denoted by s* = *s' in the 
following. Note that $MDP does not "know" this and has to learn the (non)zero 
transition probabilities. Each state has two successor states with equal probabil- 
ity, hence generates (see previous paragraph) a Bernoulli(|) state subsequence and 
a constant reward sequence, since the reward can be computed from the state = 
last two observations. Asymptotically, all four states occur equally often, hence the 
sequences have approximately the same length n/4. 

In general, if s (and similarly r) consists of x G IV i.i.d. subsequences of equal 
length n/x over yElN symbols, the code length (j3J) (and similarly (jSJ)) is 

CL(s\a;x y ) = n logy + x 1 -^- log 2 
CL(r|s, a;x y ) = nlogy + x^^- log s 
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where the extra argument x y just indicates the sequence property. So for $ 2 MDP 
we get 

CL(s|a;4 2 ) = n + 61ogf and CL(r|s, a; 4 X ) = 6 log \ 

The log-terms reflect the required memory to code the MDP structure and proba- 
bilities. Since each state has only 2 realized/possible successors, we need n bits to 
code the state sequence. The reward is a deterministic function of the state, hence 
needs no memory to code given s. 

The f MDP throws away all observations (left figure above), hence CL(s\a;li) =0. 
While the reward sequence is not i.i.d. (e.g. r t+1 = 3 cannot follow r t = 0), $ MDP 
has no choice regarding them as i.i.d., resulting in CL(s|a;l 4 ) =2n + |logn. 

The $iMDP model is an interesting compromise (middle figure above). The state 
allows a partial prediction of the reward: State allows rewards and 2; state 1 
allows rewards 1 and 3. Each of the two states creates a Bernoulli(|) state successor 
subsequence and a binary reward sequence, wrongly presumed to be Bernoulli(|). 
Hence CL(s|a;2 2 ) =n + log| and CL(r|s,a;2 2 ) = n + 31og|. 

Summary. The following table summarizes the results for general k = 0,1,2 and 
beyond: 



k 


S 


\s\ 


n 0+ 
n ss' 


+r' 
Ks> 


n °sX= n Ts' 


s+r 


CL(a|o) 


CL(r|s,o) 


Cost($|/i) 



1 

2 

>2 


{0,1} 
r 00,01 \ 
1 10,11 J 

{0,l} fc 


1 

2 
4 

2 k 


n 

n/4 
nx 


n/4 

n r 

4°r'-Ts'=0|l 
n x 

n X 


n 
n/2 
n/4 

n/2 k 


I1 + I4 
2 2 +2 2 
4 2 +4i 

2 k 2 +2 k 2 




n+logf 
n + 61ogf 

n + |^llog^ 


2n+|logn 
n+31ogf 
61ogf 

|2 fc log^ 


2n+|logn 
2n+41og§ 
n+121ogf 

n+pflogf 


2 k+l 



The notation of the s + r column follows the one used above in the text (x y for s 
and r). r'=s' means that r' is the correct reward for state s'. The last column is 
the sum of the two preceding columns. The part linear in n is the code length for 
the state/reward sequence. The part logarithmic in n is the code length for the 
transition/reward probabilities of the MDP; each parameter needs |logn bits. For 
large n, $ 2 results in the shortest code, as anticipated. The "approximate" model 
$1 is just not good enough to beat the vacuous model $0, but in more realistic 
examples some approximate model usually has the shortest code. In |Hut09aj I 
show on a more complex example how <J> bes * will store long-term information in a 
POMDP environment. 

5 Cost(<E>) Minimization 

So far I have reduced the reinforcement learning problem to a formal ^-optimization 
problem. This section briefly explains what we have gained by this reduction, and 
provide some general information about problem representations, stochastic search, 
and $ neighborhoods. Finally I present a simplistic but concrete algorithm for 
searching context tree MDPs. 
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<fr search. I now discuss how to find good summaries The introduced generic 
cost function Cost($|/t„), based on only the known history h n , makes this a well- 
defined task that is completely decoupled from the complex (ill-defined) reinforce- 
ment learning objective. This reduction should not be under-estimated. We can 
employ a wide range of optimizers and do not even have to worry about overfitting. 
The most challenging task is to come up with creative algorithms proposing $'s. 

There are many optimization methods: Most of them are search-based: random, 
blind, informed, adaptive, local, global, population based, exhaustive, heuristic, and 
other search methods |AL97] . Most are or can be adapted to the structure of the 
objective function, here Cost(-|/i n ). Some exploit the structure more directly (e.g. 
gradient methods for convex functions). Only in very simple cases can the minimum 
be found analytically (without search). 

Most search algorithms require the specification of a neighborhood relation or 
distance between candidate $, which I define in the 2nd next paragraph. 

Problem representation can be important: Since $ is a discrete function, 
searching through (a large subset of) all computable functions, is a non-restrictive 
approach. Variants of Levin search [Sch04l IHut05] and genetic programming 
[Koz92[ IBNKF98] and recurrent neural networks |Pea891 [RHHM08J are the major 
approaches in this direction. 

A different representation is as follows: $ effectively partitions the history space 
7i and identifies each partition with a state. Conversely any partition of 7i can (up 
to a renaming of states) uniquely be characterized by a function $. Formally, $ 
induces a (finite) partition [J s {h':$>(h') = s} of TC, where s ranges over the codomain 
of <3>. Conversely, any partition of TL = £>iU...U£> m induces a function ty(h') = i 
iff h! e Bi, which is equivalent to $ apart from an irrelevant permutation of the 
codomain (renaming of states). 

State aggregation methods have been suggested earlier for solving large-scale 
MDP planning problems by grouping (partitioning) similar states together, resulting 
in (much) smaller block MDPs [GDG03J. But the used bi-simulation metrics require 
knowledge of the MDP transition probabilities, while our Cost criterion does not. 

Decision trees/lists/grids/etc. are essentially space partitioners. The most pow- 
erful versions are rule-based, in which logical expressions recursively divide domain 
H into "true/false" regions [DdRDOll ISB09] . 

<fr neighborhood relation. A natural "minimal" change of a partition is to subdi- 
vide=split a partition or merge (two) partitions. Moving elements from one partition 
to another can be implemented as a split and merge operation. In our case this cor- 
responds to splitting and merging states (state refinement and coarsening). Let $' 
split some state s a ES of $ into s b ,s c ^S 

&(h) = { m if m * Sa 
1 ' \s b or s c if = s a 

where the histories mapped to state s a are distributed among s b and s c according 
to some splitting rule (e.g. randomly). The new state space is iS' = iS\{s a }U{s 6 ,s c }. 
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Similarly $' merges states s b ,s c GiS into s a ^S if 



if ^ s a 

s a if = s b or s c 



where <S' = <S\{s 6 ,s c }U{s s }. We can regard $' as being a neighbor of or similar to 

Stochastic $ search. Stochastic search is the method of choice for high- 
dimensional unstructured problems. Monte Carlo methods can actually be highly 
effective, despite their simplicity |Liu02t IFis03j . The general idea is to randomly 
choose a neighbor <£>' of $ and replace $ by if it is better, i.e. has smaller Cost. 
Even if Cost($'|/i) >Cost($|/i) we may keep but only with some (in the cost 
difference exponentially) small probability. Simulated annealing is a version which 
minimizes Cost($|/i). Apparently, $ of small cost are (much) more likely to occur 
than high cost <3>. 

Context tree example. The in Section H] depended on the last k observations. 
Let us generalize this to a context dependent variable length: Consider a finite 
complete suffix free set of strings (= prefix tree of reversed strings) S C O* as 
our state space (e.g. S = {0,01,011,111} for binary O), and define $s(h n ) := s iff 
o„_| s | + i :n = sGiS, i.e. s is the part of the history regarded as relevant. State splitting 
and merging works as follows: For binary O, if history part s G S of h n is deemed 
too short, we replace s by 0s and Is in S, i.e. S' = S\{s}U{0s,ls}. If histories 
ls,0s G S are deemed too long, we replace them by s, i.e. S f = S\{0s,ls}U {s}. 
Large O might be coded binary and then treated similarly. For small O we have 
the following simple "^-optimizer: 

<fr Improve ( <fr,s , fo n ) 

[ Randomly choose a state s G S; Example tree 

Let p and q be uniform random numbers in [0,1]; ° n ~ 2 ° n_1 
if (p > 1/2) then split s i.e. S' = S\ {s} U {os : o G O} 
else if {os f :oeO}(^S (s' is s without the first symbol) 

then merge them, i.e. S' — S\{os' :oG0}U{s'}; ^>^L 

if (Cost($ s |/i n )-Cost($ 5 /|/i n )>log(9)) then S:=S'; ^1 S = 

[ return {0,01,011,111} 




The idea of using suffix trees as state space is from [McC96] (see also [Rin94] ) . 
It might be interesting to compare the local split/merge criterion of |McC96] with 
our general global Cost criterion. On the other hand, due to their limitation, suffix 
trees are currently out of vogue. 



14 



6 Exploration & Exploitation 



Having obtained a good estimate $ of $ bes * in the previous section, we can/must 
now determine a good action for our agent. For a finite MDP with known transi- 
tion probabilities, finding the optimal action is routine. For estimated probabilities 
we run into the infamous exploration-exploitation problem, for which promising ap- 
proximate solutions have recently been suggested |SL08j . At the end of this section 
I present the overall algorithm for our $MDP agent. 

Optimal actions for known MDPs. For a known finite MDP (S,A,T,R,"y), the 
maximal achievable ("optimal") expected future discounted reward sum, called (Q) 
Value (of action a) in state s, satisfies the following (Bellman) equations pB98j 

QT = E T -'[^' + ^'] and v; = m ax Q* s a (9) 

s' 

where < 7 < 1 is a discount parameter, typically close to 1. See |Hut05t Sec. 5. 7] 
for proper choices. The equations can be solved by a simple (e.g. value or policy) 
iteration process or various other methods or in guaranteed polynomial time by 
dynamic programming |Put94j . The optimal next action is 

a n := argmaxQ*" (10) 

Estimating the MDP. We can estimate the transition probability T by 

n a+ 

T s a s , := -ff if <+ > and else. (11) 

It is easy to see that the Shannon- Fano code of si :n based on Pf (si :n |ai :n ) = 
Ylt=r^st-ist P ms ^ ne code of the (non-zero) transition probabilities T% 3 , to relevant 
accuracy 0(1/ a/ has length (TjJ, i.e. the frequency estimate ffTTj) is consistent 
with the attributed code length. The expected reward can be estimated as 

nar' 

R-ss' := 2-~l ^ss' r 5 ^ss 1 := ~^a+ (12) 

r'eTZ ss ' 

Exploration. Simply replacing T and R in (Q and fflOl) by their estimates fTTT]) 
and ffl2l) can lead to very poor behavior, since parts of the state space may never 
be explored, causing the estimates to stay poor. 

Estimate T improves with increasing which can (only) be ensured by trying 
all actions a in all states s sufficiently often. But the greedy policy above has no 
incentive to explore, which may cause the agent to perform very poorly: The agent 
stays with what he believes to be optimal without trying to solidify his belief. For 
instance, if treatment A cured the first patient, and treatment B killed the second, 
the greedy agent will stick to treatment A and not explore the possibility that B 
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may just have failed due to bad luck. Trading off exploration versus exploitation op- 
timally is computationally intractable [Hut 05 1 IPVHR061 IRP08] in all but extremely 
simple cases (e.g. Bandits [BF85, KV86J). Recently, polynomially optimal algo- 
rithms (Rmax,E3,OIM) have been invented [KS981 [BT021 [5L 08J: An agent is more 
explorative if he expects a high reward in the unexplored regions. We can "deceive" 
the agent to believe this by adding another "absorbing" high-reward state s e to S, 
not in the range of i.e. never observed. Henceforth, S denotes the extended 

state space. For instance + in ffTTl) now includes s e . We set 

n ss e = 1; n s e s = $s e s, R S s e = R"max (l^) 

for all s,a, where exploration bonus R e max is polynomially (in (1— 7)^ and |«Sx^4|) 
larger than max7?. |SL08j . 

Now compute the agent's action by (!9|)- (|T2|) but for the extended S. The optimal 
policy p* tries to find a chain of actions and states that likely leads to the high reward 
absorbing state s e . Transition T^ = l/n a s+ is only "large" for small n*, hence p* 
has a bias towards unexplored (state, action) regions. It can be shown that this 
algorithm makes only a polynomial number of sub-optimal actions. 

The overall algorithm for our $MDP agent is as follows. 

$MDP-Agent(A,7e) 

[ Initialize $ = $' = e; 5 = {e}; h = ao = r = e; 
for n= 1,2,3,... 
[ Choose e.g. 7=1 — 1/n; 
Set R max =Polynomial((l— •y)^ 1 ,\S x^4|)-max7?.; 
While waiting for o n and r n 

\ $':=$Improve($ / ,/i n _i); 
[ If Cost($ / |/i n _i)<Cost($|/i n _i) then $: = $'; 
Observe o n and r n ; h n := /i n __ia n _ir n _!O n r n ; 
s n .= $(h n ); S:=SU{s n }; 
Compute action a n from Equations P])- (fT3"]) ; 
[ [ Output action a n ; 

7 Improved Cost Function 

As discussed, we ultimately only care about (modeling) the rewards, but this en- 
deavor required introducing and coding states. The resulting Cost($|/i) function is 
a code length of not only the rewards but also the "spurious" states. This likely 
leads to a too strong penalty of models $ with large state spaces S. The proper 
Bayesian formulation developed in this section allows to "integrate" out the states. 
This leads to a code for the rewards only, which better trades off accuracy of the 
reward model and state space size. 
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For an MDP with transition and reward probabilities T s a s , and Rgl'i, the proba- 
bilities of the state and reward sequences are 

n n 

Pr(si:n|ai:n) = II T ^ ^1^1^) = ]J Rfc™ 
t=l t=l 

The probability of r\a can be obtained by taking the product and marginalizing s: 

Pu( r l:n\ a 'l:n) = /] Pr( s l:n|Ql:n)PR( r l:n| s l:n a l:n) 

= E II Kzill = E^ 1 ■ ■ ■ uan - irn losn 

Sl:n ^ — 1 Sri 

where for each a&A and r' <ElZ, matrix U ar ' <=:]R mxrn i s defined as \U ar '} ss i = Ugl', := 
Tg S ,Rf s ,. The right ra-fold matrix product can be evaluated in time 0{m 2 n). This 
shows that r given a and U can be coded in — logP(y bits. The unknown {7 needs to 
be estimated, e.g. by the relative frequency U^, :=n"j,/n"+. Note that completely 
ignores the observations 0\- n and is essentially independent of $. Map $ and hence 
Oi- n enter P^ (only and crucially) via the estimate U. The M :=m(m—l)\A\(\TZ\ — l) 
(independent) elements of U can be coded to sufficient accuracy in ^Mlogn bits, 
and $ will be coded in CL($) bits. Together this leads to a code for r\a of length 

ICost($|/i n ) := -logP#(r 1:n |ai :n ) + §Mlogn + CL($) (14) 

In practice, M can and should be chosen smaller like done in the original Cost 
function, and/or by using the restrictive model (jHJ) for R, and/or by considering 
only non-zero frequencies (T5]). Analogous to (j7]) we seek a $ that minimizes ICostQ. 

Since action evaluation is based on (discounted) reward sums, not individual 
rewards, one may think of marginalizing Pj/(r|a,$) even further, or coding rewards 
only approximately. Unfortunately, the algorithms in Section [H] that learn, explore, 
and exploit MDPs require knowledge of the (exact) individual rewards, so this im- 
provement is not feasible. 



8 Discussion 

This section summarizes $MDP, relates it to previous work, and hints at more 
efficient incremental implementations and more realistic structured MDPs (dynamic 
Bayesian networks). 

Summary. Learning from rewards in general environments is an immensely com- 
plex problem. In this paper I have developed a generic reinforcement learning algo- 
rithm based on sound principles. The key idea was to reduce general learning prob- 
lems to finite state MDPs for which efficient learning, exploration, and exploitation 
algorithms exist. For this purpose I have developed a formal criterion for evaluating 
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and selecting good "feature" maps $ from histories to states. One crucial property 
of $MDP is that it neither requires nor learns a model of the complete observation 
space, but only for the reward-relevant observations as summarized in the states. 
The developed criterion has been inspired by MDL, which recommends to select the 
(coding) model that minimizes the length of a suitable code for the data at hand 
plus the complexity of the model itself. The novel and tricky part in $MDP was 
to deal with the states, since they are not bare observations, but model-dependent 
processed data. An improved Bayesian criterion, which integrates out the states, 
has also been derived. Finally, I presented a complete feature reinforcement learning 
algorithm $MDP-Agent(). The building blocks and computational flow are depicted 
in the following diagram: 



^Transition Pr. f^ exploratiori f fc he \ 
Reward est. R J bonus ' ^ 1 , n j 

frequency estimate Bellman^ 



Feature Vec. <fr 



(Q) Value 



Cost($|/i) minimization implicit 



History h 



reward r 



Best Policy p 



observation o 



action a 



Environment 



Relation to previous work. As already indicated here and there, $MDP can be 
regarded as extending the frontier of many previous important approaches to RL 
and beyond: Partially Observable MDPs (POMDPs) are a very important general- 
ization of MDPs |KLC98j . Nature is still assumed to be an MDP, but the states 
of nature are only partially observed via some non-injective or probabilistic func- 
tion. Even for finite state space and known observation and transition functions, 
finding and even only approximating the optimal action is (harder than NP) hard 
[LGMOll IMHC03j . Lifting any of the assumptions causes conceptual problems, and 
when lifting more than one we enter scientific terra nullius. Assume a POMDP 
environment: POMDPs can formally (but not yet practically) be reduced to MDPs 
over so-called (continuous) belief states. Since $MDP reduces every problem to 
an MDP, it is conceivable that it reduces the POMDP to (an approximation of) 
its belief MDP. This would be a profound relation between <3>MDP and POMDP, 
likely leading to valuable insights into $MDP and proper algorithms for learning 
POMDPs. It may also help us to restrict the space of potentially interesting features 
$. Predictive State Representations (PSRs) are very interesting, but to this date in 
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an even less developed stage |SLJ + 03j than POMDPs. Universal AI |Hut05] is able 
to optimally deal with arbitrary environments, but the resulting AIXI agent is com- 
putationally intractable [Hut07j and hard to approximate [Pan08t [PH06J. Bayesian 
RL algorithms [DFA99L IDuf02l IPVHR061 IRPU8] (see also |KV86l Chp.ll]) can be 
regarded as implementations of the AI£ models [PH06J, which are down-scaled ver- 
sions of AIXI, but the enormous computational demand still severely limits this 
approach. $MDP essentially differs from "generative" Bayesian RL and AI£ in that 
it neither requires to specify nor to learn a (complete stochastic) observation model. 
It is a more "discriminative" approach |LJ08] . Since $MDP "automatically" mod- 
els only the relevant aspects of the environment, it should be computationally less 
demanding than full Bayesian RL. State aggregation methods have been suggested 
earlier for solving large-scale MDP planning problems by grouping (partitioning) 
similar states together, resulting in (much) smaller block MDPs |GDG03j . But the 
bi-simulation metrics used require knowledge of the MDP transition probabilities. 
$MDP might be regarded as an approach that lifts this assumption. Suffix trees 
[McC96j are a simple class of features $. $MDP combined with a local search func- 
tion that expands and deletes leaf nodes is closely related to McCallum's U-Tree 
algorithm [McC 96j. with a related but likely different split&merge criterion. Mis- 
cellaneous: $MDP also extends the theory of model selection (e.g. MDL |Grii07j ) 
from passive to active learning. 

Incremental updates. As discussed in Section |5l most search algorithms are local 
in the sense that they produce a chain of "slightly" modified candidate solutions, 
here $. This suggests a potential speedup by computing quantities of interest incre- 
mentally, which becomes even more important in the $DBN case [Hut09at IHut09c] . 

Computing Cost($) takes at most time 0(|5| 2 |^4| \TZ\). If we split or merge 
two states, we can incrementally update the cost in time 0(|«S||^4||7?.|), rather than 
computing it again from scratch. In practice, many transition T s a s , don't occur, 
and Cost($) can actually be computed much faster in time 0(\{n^, >0}|), and 
incrementally even faster. 

Iteration algorithms for (Q need an initial value for V or Q. We can take the 
estimate V from a previous $ as an initial value for the new $. For a merge operation 
we can average the value of both states, for a split operation we could give them the 
same initial value. A significant further speedup can be obtained by using prioritized 
iteration algorithms that concentrate their time on badly estimated states, which 
are in our case (states close to) the new ones [5B98J. 

Similarly, results from cycle n can be (re)used for the next cycle n + 1. For 
instance, V can simply be reused as an initial value in the Bellman equations, and 
ICost($) can be updated in time 0(|iS| 2 ) or even faster if U is sparse. 

Feature dynamic Bayesian networks. The use of "unstructured" MDPs, even 
our ^-optimal ones, is clearly limited to very simple tasks. Real world problems 
are structured and can often be represented by dynamic Bayesian networks (DBNs) 
with a reasonable number of nodes. Our $ selection principle can be adapted from 
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MDPs to the conceptually much more complex DBN case. The primary purpose 
of this Part I was to explain the key concepts on an as simple model as possible, 
namely unstructured finite MDPs, to set the stage for developing the more realistic 
$DBN in Part II |Hut09cj . 

Outlook. The major open problems are to develop smart <3? generation and smart 
stochastic search algorithms for <|> 6est ; and to determine whether minimizing (|14|) is 
the right criterion. 

Acknowledgements. My thanks go to Pedro Ortega, Sergey Pankov, Scott Sanner, 
Jiirgen Schmidhuber, and Hanna Suominen for feedback on earlier drafts. 
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A List of Notation 



Interface structures 

O = finite or infinite set of possible observations 

A = (small) finite set of actions 

1Z = {0,1} or [0,R ma x] or other set of rewards 

n£lN = current time 

o t r t a t = ora t EOxlZxA = true observation, reward, action at time t 

Internal structures for fMDP 

log 

fe{l,...,ra} 
ie{l,...,m} 

X Xlm 
^X # 

x 

H 

K 

s 

St 

P(') 
CL(-) 

MDP 

1 ss' 

s As'(r') 



7 'or' 



n ss' 



Cost ($| ft) 
ICost($|/i) 

q?,v; 

7 G[0;1) 



= binary logarithm 
= any time 
= any state index 
= X\...x n (any x) 

= Jlj x j, Uj x 3, (xi,--,xi) (any x,j,l) 
= estimate of X (any X) 
= (OxTlxA)*xOxTl = possible histories 
= ora\. n _\O n r n = actual history at time n 
= {s 1 ,...^ 171 } = internal finite state space (can vary with n) 
= state or feature summary of history 
= $(/it) G«S = realized state at time t 
= probability over states and rewards or parts thereof 
= code length 

= (S,A,T,R) = Markov Decision Process 
= P(s t — s'\st-i = s,a t _i —a) = transition matrix 
= action a in state s resulted in state s' (and reward r') 
= set of times te{l,...,n} at which s-^s'r' 
= \T s a f\ = number of times tE{l,...,n} at which sAsV 
= cost (evaluation function) of $ based on history h 
= improved cost function 
= optimal (Q) Value (of action a) in state s 
= discount factor ((1— 7) _1 is effective horizon) 
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